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Abstract 

Background: RNA exhibits a variety of structural configurations. Here we consider a structure to be tantamount to 
the noncrossing Watson-Crick and G-U-base pairings (secondary structure) and additional cross-serial base pairs. 
These interactions are called pseudoknots and are observed across the whole spectrum of RNA functionalities. In 
the context of studying natural RNA structures, searching for new ribozymes and designing artificial RNA, it is 
of interest to find RNA sequences folding into a specific structure and to analyze their induced neutral networks. 
Since the established inverse folding algorithms, RNAinverse, RNA-SSD as well as INFO-RNA are limited to RNA 
secondary structures, we present in this paper the inverse folding algorithm Inv which can deal with 3-noncrossing, 
canonical pseudoknot structures. 

Results: In this paper we present the inverse folding algorithm Inv. We give a detailed analysis of Inv, including 
pseudocodes. We show that Inv allows to design in particular 3-noncrossing nonplanar RNA pseudoknot 3- 
noncrossing RNA structures-a class which is difficult to construct via dynamic programming routines. Inv is 



freely available at http://www.coinbinatorics.cn/cbpc/inv.html 



Conclusions: The algorithm Inv extends inverse folding capabilities to RNA pseudoknot structures. In comparison 
with RNAinverse it uses new ideas, for instance by considering sets of competing structures. As a result, Inv is 
not only able to find novel sequences even for RNA secondary structures, it does so in the context of competing 
structures that potentially exhibit cross-serial interactions. 
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Figure 1: The pseudoknot structure of the glmS ri- 
bozyme pseudoknot Pl.l [7] as a diagram (top) and as 
a planar graph (bottom). 

1 Introduction 

Pseudoknots are structural elements of central im- 
portance in RNA structures [T], see Figure [TJ They 
represent cross-serial base pairing interactions be- 
tween RNA nucleotides that are functionally impor- 
tant in tRNAs, RNaseP [2], telomerase RNA [5], and 
ribosomal RNAs [J. Pseudoknot structures are be- 
ing observed in the mimicry of tRNA structures in 
plant virus RNAs as well as the binding to the HIV- 
1 reverse transcriptase in in vitro selection experi- 
ments [5]. Furthermore basic mechanisms, like ribo- 
somal frame shifting, involve pseudoknots [6]. 

Despite them playing a key role in a variety of 
contexts, pseudoknots are excluded from large-scale 
computational studies. Although the problem has 
attracted considerable attention in the last decade, 
pseudoknots are considered a somewhat "exotic" 
structural concept. For all we know [8], the ab ini- 
tio prediction of general RNA pseudoknot structures 
is NP-complete and algorithmic difficulties of pseu- 



doknot folding are confounded by the fact that the 
thermodynamics of pseudoknots is far from being 
well understood. 

As for the folding of RNA secondary structures. 
Waterman et al [SJITO], Zuker et al [TT| and Nussi- 
nov [12j established the dynamic programming (DP) 
folding routines. The first mfe-folding algorithm for 
RNA secondary structures, however, dates back to 
the 60 's p!3Hl5] . For restricted classes of pseudo- 
knots, several algorithms have been designed: Rivas 
and Eddy [16], Dirks and Pierce [17], Reeder and 
Giegerich [18] and Ren et al [19]. Recently, a novel 
ah initio folding algorithm Cross has been intro- 
duced [20j . Cross generates minimum free energy 
(mfe), 3-noncrossing, 3-canonical RNA structures, 
i.e. structures that do not contain three or more mu- 
tually crossing arcs and in which each stack, i.e. se- 
quence of parallel arcs, see eq. ([1]), has size greater 
or equal than three. In particular, in a 3-canonical 
structure there are no isolated arcs, see Figure [51 




Staok_l Stack_2 Stack_3 



Figure 2: a-canonical RNA structures: each stack of 
"parallel" arcs has to have minimum size a. Here we 
display a 3-canonical structure. 

The notion of mfe-structure is based on a spe- 
cific concept of pseudoknot loops and respective 
loop-based energy parameters. This thermodynamic 
model was conceived by Tinoco and refined by 
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Freicr, Turner, Ninio, and others [T31I2IH2S]- 

1.1 /c-noncrossing, ci-canonical RNA pseudoknot 
structures 

Let us turn back the clock: three decades ago Water- 
man et al. [26] , Nussinov et al. [12) and Kleitman et 
al. in [23 analyzed RNA secondary structures. Sec- 
ondary structures are coarse grained RNA contact 
structures, see Figure |3l 





Figure 3: The phenylalanine tRNA secondary structure 
represented as 2-noncrossing diagram (top) and as planar 
graph (bottom). 



Secondary structures can be represented as dia- 
grams, i.e. labeled graphs over the vertex set [n] = 
{1, . . . ,n} with vertex degrees < 1, represented by 
drawing its vertices on a horizontal line and its arcs 
[i < j), in the upper half-plane, see Fig- 
ure [1] and Figure m 

Here, vertices and arcs correspond to the nu- 
cleotides A, G, U, C and Watson-Crick (A-U, G- 
C) and (U-G) base pairs, respectively. 



In a diagram, two arcs and (12,^2) are 

called crossing if ii < 1-2 < ji < j2 holds. 
Accordingly, a fc-crossing is a sequence of arcs 
(ii, ji), . . . , (ikdk) such that ii < 12 < • • • < ifc < 
ji < j2 < ■ ■ ■ < jk, see Figure O 




1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 

Figure 4: Setting k — 2we observe that secondary struc- 
tures are a particular type of fc-noncrossing structures. 
They coincide with noncrossing diagrams having mini- 
mum arc-length two. 




1 2 34 5 6 7 89 10 11 

Figure 5: fc- noncrossing diagrams: we display a 4- 
noncrossing diagram containing the three mutually cross- 
ing arcs (1, 7), (4, 9), (5, 11) (drawn in red). 



We call diagrams containing at most (k — 1)- 
crossings, fc-noncrossing diagrams. RNA secondary 
structures have no crossings in their diagram repre- 
sentation, sec Figure O and Figure 21 and are there- 
fore 2-noncrossing diagrams. A structure in which 
any stack has at least size a is called cr-canonical, 
where a stack of size cr is a sequence of "parallel" 
arcs of the form 

j), (z + l,j-l), . . . , (z + ((7-l),j-(a-l))). (1) 

As a natural generalization of RNA secondary 
structures fc-noncrossing RNA structures [28ll30] 
were introduced. A fc-noncrossing RNA structure 



is fc-noncrossing diagram without arcs of the form 
(i, i + In the foUowing we assume fc = 3, i.e. in the 
diagram representation there are at most two mutu- 
ally crossing arcs, a minimum arc-length of four and 
a minimum stack-size of three base pairs. The no- 
tion fc-noncrossing stipulates that the complexity of 
a pseudoknot is related to the maximal number of 
mutually crossing bonds. Indeed, most natural RNA 
pseudoknots are 3-noncrossing |31) . 

1.2 Neutral networks 

Before considering an inverse folding algorithm into 
specific RNA structures one has to have at least some 
rationale as to why there exists one sequence realiz- 
ing a given target as mfe-configuration. In fact this 
is, on the level of entire folding maps, guaranteed 
by the combinatorics of the target structures alone. 
It has been shown in [32], that the numbers of 3- 
noncrossing RNA pseudoknot structures, satisfying 
the biophysical constraints grows asymptotically as 
C3n~^2.03", where C3 > is some explicitly known 
constant. In view of the central limit theorems of 
[55] . this fact implies the existence of extended (ex- 
ponentially large) sets of sequences that all fold into 
one 3-noncrossing RNA pseudoknot structure, S. 
In other words, the combinatorics of 3-noncrossing 
RNA structures alone implies that there are many 
sequences mapping (folding) into a single structure. 
The set of all such sequences is called the neutral 
networj^ of the structure S [34l|35] , see Figure [6l 

^the term "neutral network" as opposed to "neutral set" 
stems from giant component results of random induced sub- 




Sequence space Structure space 



Figure 6: Neutral networks in sequence space: we dis- 
play sequence space (left) and structure space (right) as 
grids. We depict a set of sequences that all fold into a 
particular structure. Any two of these sequences are con- 
nected by a red edge. The neutral network of this fixed 
structure consists of all sequences folding into it and is 
typically a connected subgraph of sequence space. 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 




(A,U,G,C,C,G) (AU,UA,GC,CG,UG,GU) 



Figure 7: A structure and a particular compatible se- 
quence organized in the segments of unpaired and paired 
bases. 

By construction, all the sequences contained in 
such a neutral network are all compatible with S. 
That is, at any two positions paired in S, we find 
two bases capable of forming a bond (A-U, U-A, 
G-C, C-G, G-U and U-G), see Figure H Let s' 
be a sequence derived via a mutatiory of s. If s' 
is again compatible with S, we call this mutation 
"compatible" . 

Let C[S] denote the set of S'-compatible se- 
quences. The structure S motivates to consider a 
new adjacency relation within C[S]. Indeed, we may 



graphs of n-cubes. That is, neutral networks are typically 
connected in sequence space 

^note: we do not consider insertions or deletions. 
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A G U A 
12 3 4 




Figure 8: Diagram representation of an RNA structure 
(top) and its induced compatible neighbors in sequence 
space (bottom). Here the neighbors on the inner circle 
have Hamming distance one while those on the outer 
circle have Hamming distance two. Note that each base 
pair gives rise to five compatible neighbors (red) exactly 
one of which being in Hamming distance one. 

reorganize a sequence (si, . . . , s„) into the pair 

((ui,...,u„„),(pi,...,p„J) , (2) 

where the Uh denotes the unpaired nucleotides and 
the ph = {si,Sj) denotes base pairs, respectively, see 
Figure [T] We can then view s„ = (iti , . . . , ) and 
Sp = {pi , . . . , Pup ) as elements of the formal cubes 
Q"" and Qq'', implying the new adjacency relation 
for elements of C[S]. 

Accordingly, there are two types of compatible 
neighbors in the sequence space u- and p-ncighbors: 
a u-neighbor has Hamming distance one and differs 
exactly by a point mutation at an unpaired position. 
Analogously a p-neighbor differs by a compensatory 
base pair-mutation, see Figure [S) 

Note, however, that a p-ncighbor has either Ham- 
ming distance one (G-C i— > G-U) or Hamming dis- 
tance two (G-C 1-^ C-G). We call a u- or a p- 



neighbor, y, a compatible neighbor. In light of the 
adjacency notion for the set of compatible sequences 
we call the set of all sequences folding into S the 
neutral network of S. By construction, the neutral 
network of S is contained in C[S]. If y is contained 
in the neutral network we refer to y as a neutral 
neighbor. This gives rise to consider the compatible 
and neutral distance of the two sequences, denoted 
by C(s, s') and N{s,s'). These are the minimum 
length of a C[S']-path and path in the neutral net- 
work between s and s', respectively. Note that since 
each neutral path is in particular a compatible path, 
the compatible distance is always smaller or equal 
than the neutral distance. 

In this paper we study the inverse folding prob- 
lem for RNA pseudoknot structures: for a given 
3-noncrossing target structure S, we search for se- 
quences from C[S'], that have S as mfe configuration. 

2 Background 

For RNA secondary structures, there are three dif- 
ferent strategies for inverse folding, RNAinverse, 
RNA-SSD and INFO-RNA p6H38] . 

They all generate via a local search routine itera- 
tively sequences, whose structures have smaller and 
smaller distances to a given target. Here the distance 
between two structures is obtained by aligning them 
as diagrams and counting "0" , if a given position is 
either unpaired or incident to an arc contained in 
both structures and "1", otherwise, see Figure IHl 

One common assumption in these inverse fold- 
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Figure 9: Positions paired differently in Si and 5*2 are 
assigned a "1". There are two types of positions: I. p 
is contained in different arcs, see position 4, (4, 20) G Si 
and (4, 17) G 82- II. p is unpaired in one structure and 
p is paired in the other, such as position 18. 

ing algorithms is, that the energies of specific sub- 
structures contribute additively to the energy of the 
entire structure. Let us proceed by analyzing the 
algorithms. 

RNAinverse is the first inverse-folding algorithm that 
derives sequences that realize given RNA secondary 
structures as mfe-configuration. In its initialization 
step, a random compatible sequence s for the tar- 
get T is generated. Then RNAinverse proceeds by 
updating the sequence s to s',s"... step by step, 
minimizing the structure distance between the mfe 
structure of s' and the target structure T. Based on 
the observation, that the energy of a substructure 
contributes additively to the mfe of the molecule, 
RNAinverse optimizes "small" substructures first, 
eventually extending these to the entire structure. 
While optimizing substructures, RNAinverse does 
an adaptive walk in order to decrease the structure 
distance. In fact, this walk is based entirely on ran- 



dom compatible mutations. 

RNA-SSD RNA-SSD first assigns specific probabilities 
to the bases located in unpaired positions and the 
base pairs (G-C, A-U, U-G) of T, respectively. In 
this assignment the probability of a unpaired posi- 
tion being assigned either A or U is greater than 
assigning G or C. Similarly, the probability of pairs 
G-C and C-G base pairs is greater than that of the 
other base pairs. Then, RNA-SSD derives a hierar- 
chical decomposition of the target structure. It re- 
cursively splits the structure and thereby derives a 
binary decomposition tree rooted in T and whose 
leaves correspond to T-substructures. Each non- 
leaf node of this tree represents a substructure ob- 
tained by merging the two substructures of its re- 
spective children. Given this tree, RNA-SSD performs 
a stochastic local search, starting at the leaves, sub- 
sequently working its way up to the root. 
INFO-RNA employs a dynamic programming 
method for finding a well suited initial sequence. 
This sequence has a lowest energy with respect 
to the T. Since the latter does not necessarily 
fold into T, (due to potentially existing com- 
peting configurations) INFO-RNA then utilizes an 
improveclfl stochastic local search in order to find a 
sequence in the neutral network of T. In contrast 
to RNAinverse, INFO-RNA allows for increasing 
the distance to the target structure. At the same 
time, only positions that do not pair correctly and 

^relative to the local search routine used in RNAinverse 
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positions adjacent to these are examined. 
2.1 Cross 

Cross is an ab initio folding algorithm that maps 
RNA sequences into 3-noncrossing RNA structures. 
It is guaranteed to search all 3-noncrossing, a- 
canonical structures and derives some (not necessar- 
ily unique), loop-based mfc-configuration. In the fol- 
lowing we always assume cr > 3. The input of Cross 
is an arbitrary RNA sequence s and an integer N. 
Its output is a list of N 3-noncrossing, cr-canonical 
structures, the first of which being the mfe-structure 
for s. This hst of N structures (Co, Ci, . . . , Cjv-i) is 
ordered by the free energy and the first list-element, 
the mfe-structure, is denoted by Cross(s). If no N 
is specified. Cross assumes iV = 1 as default. 

Cross generates a mfe-structure based on specific 
loop-types of 3-noncrossing RNA structures. For a 
given structure S, let a be an arc contained in S 
(S'-arc) and denote the set of S'-arcs that cross a by 
£/s(a)- 

For two arcs a = (i,j) and a' = {i',j'), we next 
specify the partial order over the set of arcs: 

a' ^ a if and only if i < i' < j' < j. 

All notions of minimal or maximal elements are un- 
derstood to be with respect to ^. An arc a G ^s(/3) 
is called a minimal, /3-crossing if there exists no 
a' e £/s{f3) such that a' ^ a. Note that a e s^siP) 
can be minimal /3-crossing, while (3 is not minimal 
a-crossing. 3-noncrossing diagrams exhibit the fol- 




lOO ^ 



Figure 10: The standard loop-types: hairpin-loop (top), 
interior- loop (middle) and multi-loop (bottom). These 
represent all loop-types that occur in RNA secondary 
structures. 

lowing four basic loop-types: 

(1) A hairpin-loop is a pair 

((z,j),[* + l,j-l]) 

where («,.7) is an arc and is an interval, i.e. a 
sequence of consecutive vertices (i , i + 1 , . . . , j — 1 , j ) . 

(2) An interior-loop, is a sequence 

((«l,jl), [n + 1,«2 - 1], (j2,j2), [j2 + l,Jl - 1]), 

where (12,72) is nested in (ii,ji). That is we have 
«i < «2 < 32 < ji- 

(3) A multi-loop, see Figure [TUl pO] . is a sequence 

((*!, ji), [Ji + 1, ^1 - 1] , s:\ , [n + 1, - 1] , 5;^,, . . . ), 

where S^'^^ denotes a pseudoknot structure over 
[ujfi,Th] (i.e. nested in (ji,ji)) and subject to the 
following condition: if all S*^^ = [ujh.Th), i.e. all 
substructures are just arcs, for all ft,, then we have 
h > 2). 

A pseudoknot, see Figure [TT] [50] , consists of the 
following data: 




Figure 11: Pseudoknot loops, formed by all blue vertices 
and arcs. 

(PI) A set of arcs 

P ^ {(«l,Jl),(i2,j2), ■•■,(«*, it)}, 

where «i = min{z/i} and jt = inax{j/j}, such that 

(i) the diagram induced by the arc-set P is irredu- 
cible, i.e. the dependency-graph of P (i.e. the 
graph having P as vertex set and in which a 
and a' are adjacent if and only if they cross) 
is connected and 

(ii) for each {ih,jh) G P there exists some arc 
/3 (not necessarily contained in P) such that 
{ih,jh) is minimal /3-crossing. 

(P2) Any ii < x < jt, not contained in hairpin-, 
interior- or multi-loops. 

Having discussed the basic loop- types, we are 
now in position to state 

Theorem 1 Any i-noncrossing RNA pseudoknot 
structure has a unique loop- decomposition [20] . 

Figure [T^ illustrates the loop decomposition of a 
3-noncrossing structure. 

A motif in Cross is a 3-noncrossing structure, 
having only ^-maximal stacks of size exactly a, see 




Figure 12: Loop decomposition: here a hairpin- loop (I), 
an interior-loop (II) , a multi-loop (III) and a pseudoknot 
(IV). 




Figure 13: Motif: a 3-noncrossing, 3-canonical motif. 

Figure [121 A skeleton, S, is a /c-noncrossing struc- 
ture such that 

• its core, c{S) has no noncrossing arcs and 

• its L-graph, L{S) is connected. 

Here the core of a structure, c{S), is obtained by 
collapsing its stacks into single arcs (thereby reduc- 
ing its length) and the graph L{S) is obtained by 
mapping arcs into vertices and connecting any two 
if they cross in the diagram representation of S, see 
Figure [T4| As for the general strategy. Cross con- 
structs 3-noncrossing RNA structure "from top to 
bottom" via three subroutines: 
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Figure 14: Skeleton and its L-graph: we display a skele- 
ton (left) and its L-graph (right). 



I (Shadow): Here we generate all maximal stacks of 
the structure. Note that a stack is maximal with 
respect to -< if it is not nested in some other stack. 
This is derived by "shadowing" the motifs, i.e. their 
(T-stacks are extended "from top to bottom" . 

II (SkeletonBranch): Given a shadow, the second 
step of Cross consists in generating, the skeleta- 
tree. The nodes of this tree are particular 3- 
noncrossing structures, obtained by successive inser- 
tions of stacks. Intuitively, a skeleton encapsulates 
all cross-serial arcs that cannot be recursively com- 
puted. Here the tree complexity is controlled via 
limiting the (total) number of pseudoknots. 

III (Saturation): In the third subroutine each skele- 
ton is saturated via DP-routines. After the satura- 
tion the mfe-3-noncrossing structure is derived. 

Figure [15] provides an overview on how the three 
subroutines are combined. 

3 The algorithm 

The inverse folding algorithm Inv is based on the ab 
initio folding algorithm Cross. The input of Inv is 
the target structure, T. The latter is expressed as 
a character string of ": () []{}", where ":" denotes 
unpaired base and "()", "[]", "{}" denote paired 
bases. 

In Algorithm[I] we present the pseudocodes of al- 
gorithm Inv. After validation of the target structure 
(lines 2 to 5 in Algorithm [T]) , similar to INFO-RNA, 
Inv constructs an initial sequence and then proceeds 




AAACUUUGCG AAACUUUGCG AAACUUUGCG AAACUUUGCG AAACUUUGCG 



Cr:yyy^*^*jQmt^tn^ 

AAACUUUGCG AAACUUUGCG AAACUUUGCG 



Figure 15: An outline of Cross (for illustration pur- 
poses we assume here = 1): The routines Shadow, 
SkeletonBranch and Saturation are depicted. Due 
to space limitations we only represent a few select motifs 
and for the same reason only one of the motifs displayed 
in the first row is extended by one arc (drawn in blue). 
Furthermore note that only motifs with crossings give 
rise to nontrivial skeleton-trees, all other motifs are con- 
sidered directly as input for Saturation. 



by a stochastic local search based on the loop decom- 
position of the target. This sequence is derived via 
the routine Adjust-Seq. Wc then decompose the 
target structure into loops and endow these with a 
linear order. According to this order we use the rou- 
tine Local-Search in order to find for each loop a 
"proper" local solution. 

3.1 Adjust-Seq 

In this section we describe Steps 2 and 3 of the 
pseudocodes presented in Algorithm [TJ The rou- 
tine Make-Start, sec line 8, generates a random 
sequence, start, which is compatible to the target, 
with uniform probability. 

We then initialize the variable segmin via the 
sequence start and set the variable d — +co, 
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Algorithm 1 Inv 



:()[]{>" 



Input: fc-noncrossing target structure T 
Output: an RNA sequence seq 
Require: fc < 3 and T is composed with 
Ensure: Cross{seq) = T 
1: O Step 1: Validate structure 
if false = CHECK-STRu(r) then 
print incorrect structure 
return NIL 
end if 



2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 

14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
27: 
28: 
29: 
30: 
31: 
32: 



[> Step 2: Generate tlie start sequence 
start Make-Start(T) 

[> Step 3: Adjust tiie start sequence 
segmiddic <~ ADJUST-SEQ(start,T) 

\> Step 4: Decompose T and derive tiie ordered in- 
tervals. 

Interval array / 

m |7| > 7 satisfies Im = T 
[> Step 5: Stochastic Local Search 

seq -S— SeQmiddle 

for all intervals in the array do 

/ start-point(/ii,) 

r end-point(/„) 

s' <— seg|[; ,,] t> get sub-sequence 

seq\[i^r] Local-Search(s', /u,) 
end for 

> Step 6: output 

if seqmin = Cross(seg) then 

return seq 
else 

print Failed! 

return NIL 
end if 



where d denotes the structure distance between 
Cross(se(7niin) and T. 

Given the sequence start, we construct a set of 
potential "competitors" , C, i.e. a set of structures 
suited as folding targets for start. In Algorithm [2] we 
show how to adjust the start sequence using the rou- 
tine Adjust-Seq. Lines 4 to 38 of Algorithm^ con- 
tain a For-loop, executed at most y/n/2 times. Here 
the loop- length y/n/2 is heuristically determined. 

Setting the Cross-paramctei|j, N, the subroutine 
executed in the loop-body consists of the following 
three steps. 

Step I. Generating C"(A') via Cross. Suppose we are 
in the ith step of the For-loop and are given the 
sequence A*~^ where A" ~ start. We consider 
Cross(A*^^, A^), i.e. the list of suboptimal structures 
with respect to A'~^, 

C°(A-i) = Cross(A-\7V) = iCUX'-'))^-^' 

If Cg(A'-i) = T, then Inv returns A'-^. Else, in 
case oi d= (Cross(CQ (A*^^)), T) < dmin, we set 

<in = d(Cross(C°(V-i)),T). 

Otherwise we do not update seqmin and go directly 
to Step II. 

Step II. The competitors. Wc introduce a specific pro- 
cedure that "perturbs" arcs of a given RNA pseudo- 
knot structure, S. Let a be an arc of S and let l{a), 
r{a) denote the start- and end-point of a. A pertur- 



For all computer experiments we set N = 50. 



10 




i-1 I 1+1 j-1 i j+1 



Figure 16: Nine perturbations of an arc (i, j). Original 
arcs are drawn dotted, and the arcs incident to red bases 
are the perturbations. 

bation of a is a procedure which generates a new arc 
a', such that 

|/(a) - /(a')| < 1 and \r{a) - r{a')\ < I . 

Clearly, there are nine perturbations of any given arc 
a (including a itself), sec Figure [T6l 

We proceed by keeping a, replacing the arc a by 
a nontrivial perturbation or remove a, arriving at a 
set of ten structures a). 

Now we use this method in order to generate the 
set C^(A*^^) by perturbing each arc of each struc- 
ture CO(A*-i) e CO(A'-i). If CO(A*-i) has Ah arcs, 
{al,...,a^''}, then 

JV-l Ah 

C'iX'-')^ U UKC,°(A-i),al). 

h=0 j=l 

This construction may result in duplicate, inconsis- 
tent or incompatible structures. Here, a structure 
is inconsistent if there exists at least one position 
paired with more than one base, and incompatible 
if there exists at least one arc not compatible with 
A*~^, see Figures [T71 and [T51 Here compatibility is 



A U U A 

Figure 17: Inconsistent structures: the dotted arc is 
perturbed by shifting its end-point. This perturbation 
leads to a nucleotide establishing two base pairs, which 
is impossible. 




AG G U C 



Figure 18: Incompatible structures: we display a per- 
turbation of the dotted arc leading to a structure that is 
incompatible to the given sequence. 

understood with respect to the Watson-Crick and 
G-U base pairing rules. Deleting inconsistent and 
incompatible structures, as well as those identical to 
the target, we arrive at the set of competitors, 

C(A'-i) c Ci(A^-i). 

Step III. Mutation Here we adjust A*~^ with respect 
to T as well as the set of competitors, C(A*~^) 
derived in the previous step. Suppose A'^^ = 




1 2 3 d 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 




Figure 19: Mutation: Suppose the top and middle struc- 
tures represent the set of competitors and the bottom 
structure is target. We display A*~^ (top sequence) and 
its mutation, A' (bottom sequence). Two nucleotides 
of base pairs not contained in T are colored green, nu- 
cleotides subject to mutations are colored red. 
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s\~^s'2~^ ■ . ■ sl~^. Let p{S,w) be the position paired 
to the position w in the RNA structure S E C(A'^^), 
or if position w is unpaired. For instance, in 
Figure [H we have p(T, 1) = 4, p{T,2) ^ and 
p(r, 4) = 1. For each position w of the target T, 
if there exists a structure C/i(A*^^) G C(A'^^) such 
that p(C;i(A*~^), It;) y^p{T,w) (see positions 5, 6, 9, 
and 11 in Figure fT9|) we modify A*^^ as follows: 

1. unpaired position: If p(T,u') — 0, we up- 



date 



randomly into the nucleotide s^, ^ 



such that for each C/i(A*^^) G C(A 



either p(C/i(A*^^), w) = or sj^ is not compat- 
ible with sl~^ where v = p{Ch{^^~^),w) > 0, 
See position 6 in Figure [121 

2. start-point: If p(r, w) > w, set w = 
p{T,w). We randomly choose a compatible 
base pair (s^,s^) different from {s"~^,sl~^), 
such that for each Ch{X'-^) G C(A'-i), ei- 
ther p(C/i(A*~^), w) = or is not compat- 
ible with where u ~ p{Ch{y~^),uj) > 
is the end-point paired with sl^^ in Ch{X^^^) 
(Figure m (5,9). The pair G-C retains the 
compatibility to (5,9), but is incompatible to 
(5,10)). By Figure [20l we show feasibility of 
this step. 

3. end-point: If < p{T,w) < w, then by con- 
struction the nucleotide has already been con- 
sidered in the previous step. 
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C 
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Figure 20: Mutations are always possible: suppose p 
is paired with g in T and p is paired with qi in one 
competitor and 52 in another one. For a fixed nucleotide 
at p there are at most two scenarios, since a base can pair 
with at most two difi'erent bases. For instance, for G 
we have the pairs G-C, G-U. We display all nucleotide 
configurations (LHS) and their corresponding solutions 
(RHS). 
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Therefore, updating all the nucleotides of A*~^, 
we arrive at the new sequence A' = s^^Sj . . . s^. 

Note that the above mutation steps hcuristically 
decrease the structure distance. However, the re- 
sulting sequence is not necessarily incompatible to 
all competitors. For instance, consider a competi- 
tor Ch whose arcs are all contained T. Since A* is 
compatible with T, A' is compatible with Ch- Since 
competitors are obtained from suboptimal folds such 
a scenario may arise. 

In practice, this situation represents not a prob- 
lem, since these type of competitors are likely to be 
ruled out by virtue of the fact that they have a mfe 
larger than that of the target structure. 

Accordingly we have the following situation, 
competitors are eliminated due to two, equally im- 
portant criteria: incompatibility as well as minimum 
free energy considerations. 

If the distance of Cross(A*) to T is less than or 
equal to dmin + 5, we return to Step I (with A'). 
Otherwise, we repeat Step III (for at most 5 times) 
thereby generating A^ , . . . , A| and set A* = Xl^ where 
d(Cross(A^), r) is minimal. 

The procedure Adjust-Seq employs the neg- 
ative paradigm |17| in order to exclude energeti- 
cally close conformations. It returns the sequence 
segniiddio which is tailored to realize the target struc- 
ture as mfe-fold. 



Algorithm 2 Adjust-Seq 
Input: the original start sequence start 
Input: the target structure T 
Output: a initialized sequence segmiddie 
1; 71 length of T 

2: dmin < hOO, Seq^nin <— StaVt 

3: for j = 1 to |x/n do 

4: > Step I: generate the set C'^(X'~^) via Cross 

5: C°{y-'^) i~ Cross(A^-\iV) 

6: d(Cg(A'-'),r) 

7: if d = then 

8: return A*"^ 

9: else if d < dmin then 

10: dmin <— d, Seijmin A'"^ 

11: end if 

12: 

13: [> Step II: generate the competitor set C(A'~^) 

14: C^(A^-^)^(^ 

15: for all Cfc(A'"^) £ C^(A'-^) do 
16: for all arc ai of Cl{X'~'^) do 

17: c\y-')^c\x'-')uu{cuy),ai) 

18: end for 
19: end for 

20: C(A'-') = 

21: {C^(A"-i) e Ci(A'-i) : C,UA'-i)is valid} 
22: 

23: > Step III: mutation 

24: seq <- A'~^ 

25: for TO = 1 to n do 

26: if 3Ch(A'-^) e C(A'-i) s.t. p{Ch,w)j^p{T,w) 
then 

27: seg[u;] random nucleotide or pair, s.t. 

VCh(A""') G C(A"-^), seq G C[T] and seq ^ 

28: end if 
29: end for 

30: Tseq Cross(seg) 

31: if d{Tseq, T) < dmin + 5 then 

32: segmiddie <- seq 

33: else if Step III run less than 5 times then 
34: goto Step III 
35: end if 

36: end for > loop to line 3 
37: 

38: return segmiddie 
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3.2 Decompose and Local-Search 
In this section we introduce two the routines, 
Decompose and Local-Search. The routine 
Decompose partitions T into Hnearly ordered en- 
ergy independent components, see Figure [T^] and 
Section 12.11 Local-Search constructs iteratively 
an optimal sequence for T via local solutions, that 
are optimal to certain substructures of T. 

Decompose: Suppose T is decomposed as fol- 
lows, 

B = {Ti, . . . , Tm'} ■ 

where the are the loops together with all arcs in 
the associated stems of the target. 

We define a linear order over B as follows: T^, < 
T/i if either 

1. is nested in Th, or 

2. the start-point of T^u precedes that of Th- 
in Figure [21] we display the linear order of the 

loops of the structure shown in Figure [T2l 
Next we define the interval 

projecting the loop onto the interval 

[l{T^),r{T^)] and bm = [''i^'] ^ ^to, being the 
maximal interval consisting of and its adjacent 
unpaired consecutive nucleotides, see Figure [T^l 
Given two consecutive loops < r^+i, we have 
two scenarios: 

• either 6^ and bw+i are adjacent, see 65 and bg 
in Figure [HJ 




T7 



Figure 21: Linear ordering of loops: ai — [11, 19], fei = 
[10,20], a2 = [7,37], 62 = [5,39], ag = [21,42], &3 = 
[20,44], a4 = [25,47], 64 = [24,48], as = [7,47], 65 = 
[5,48], as = [49,57], h = [48,59], 07 = [1,63], 67 = 
[1,65]. 




1 2.3 4 5 6 7 8 9 10 

L I ■ 

Li 1 

L I I 

h, , 



Figure 22: Loops and their induced sequence of inter- 
vals. 

• or bjjj C see bi and 62 in Figure [HJ 

Let Cw = U'h^ibh, then we have the sequence of inter- 
vals ai , 61 , ci , . . . , a„i' , h„i' , Cm' . If there are no un- 
paired nucleotides adjacent to a^, then a^, = b^, 
and we simply delete all such bw . Thereby we derive 
the sequence of intervals Ii, I2, ■ ■ ■ , Im- In Figure [22] 
we illustrate how to obtain this interval sequence: 
here the target decomposes into the loops Ti, T2 
and we have Ii = [3,5], I2 = [3,6], I3 = [2,9], and 
/4- [1,10]. 
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Local-Search: Given the sequence of in- 
tervals /i,/2,--- We proceed by perform- 
ing a local stochastic search on the subsequences 
seg|/j , seq\i.^ , • • ■ , seq\i^ (initialized via seq — 
seginiddio and where s\[^^y^ = s^s^+i . . . Sy). When 
we perform the local search on seq\i^, only posi- 
tions that contribute to the distance to the target, 
see Figure m or positions adjacent to the latter, will 
be altered. We use the arrays Ui, U2 to store the 
unpaired and paired positions of T. In this process, 
we allow for mutations that increase the structure 
distance by five with probability 0.1. The latter pa- 
rameter is heuristically determined. We iterate this 
routine until the distance is either zero or some halt- 
ing criterion is met. 

4 Discussion 

The main result of this paper is the presentation of 
the algorithm Inv, freely available at 



Algorithm 3 Local-Search 



http://www.combinatorics.cn/cbpc/inv.htnil 



Its input is a 3-noncrossing RNA structure T, given 
in terms of its base pairs (11,12) (where ii < 12)- 
The output of Inv is an RNA sequences s = 
(siS2...s„), where Sh G {A,C,G,G} with the 
property Cross (s) = T, see Figure [23l 

The core of Inv is a stochastic local search 
routine which is based on the fact that each 3- 
noncrossing RNA structure has a unique loop- 
decomposition, see Theorem [T] in Section 12.11 Inv 
generates "optimal" subsequences and eventually ar- 



Input: segmiddio 
Input: the target T 
Output: seq 
Ensure: Cross (seg) = T 



1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
27: 
28: 
29: 
30: 
31: 
32: 
33: 
34: 
35: 
36: 
37: 
38: 
39: 
40: 
41: 
42: 



seq ^ seijmiddic 

if Cross(seq) = T then 

return seq 
end if 

decompose T and derive the ordered intervals. 

/ <- [/l,/2, . . . ,/ml 

for all /,„ in / do 

> Phase I: Identify positions. 

dmiji = d{Cross(seq\i^, ,T\i^) [> initialize dmin 

derive Ui via Cross(seg|/^),T|/j„ 
derive U2 via Cross (seg| ),T|/^ 

[> Phase II: Test and Update, 
for all p in Ui do 

random T compatible mutate seqp 
end for 

for all [p, q] in U2 do 

random T compatible mutate seqp 
end for 



E ■ 



for allpG Ui,U2 do 



d d{T, Cross(seq'p)) 
if d < dmin then 

rfmin <— d, seq seqp 

goto Phase I 
else if d,nin < d < dmin + 5 then 

goto Phase I with the probability 0.1 
end if 

if d = dmin then 
E ^ E\J {seq} 
end if 
end for 

seq <— eo G E, where eo has the lowest mfe in E 
if Phase I run less than 10 n times then 

goto Phase I 
end if 

end for 

return seq 
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GUUGOGGUGOGGUAAUGACUGUCAGCAGAAACCUCGACUGUGGGGGAGGUUUOUGA 
GUGGAGACAGAGCGUUACGCUCCAACUGUAUGGGGGGUCUUUGGGCUOCAUGUAGG 
CGCGOGUGGUGUAUGUCGGAGACGGUGGGGOOCGGGUGCGUGUAACUGGGCGUUAA 

Figure 23: UTR pseudoknot of bovine coronavirus |39) : 
its diagram representation and three sequences of its neu- 
tral network as constructed by Inv. 

rives at a global solution for T itself. Inv generalizes 
the existing inverse folding algorithm by considering 
arbitrary 3-noncrossing canonical pseudoknot struc- 
tures. Conceptually, Inv differs from INFO-RNA in 
how the start sequence is being generated and the 
particulars of the local search itself. 

As discussed in the introduction it has to be given 
an argument as to why the inverse folding of pseudo- 
knot RNA structures works. While folding maps into 
RNA secondary structures are well understood, the 
generalization to 3-noncrossing RNA structures is 
nontrivial. However the combinatorics of RNA pseu- 
doknot structures [28l[29l|40] implies the existence 
of large neutral networks, i.e. networks composed 
by sequences that all fold into a specific pseudoknot 
structure. Therefore, the fact that it is indeed pos- 
sible to generate via Inv sequences contained in the 
neutral networks of targets against competing pseu- 
doknot configurations, see Figure [23] and Figure [24l 
confirms the predictions of |32j . 

An interesting class are the 3-noncrossing non- 
planar pseudoknot structures. A nonplanar pseudo- 
knot structure is a 3-noncrossing structure which is 
not a bi-secondary structure in the sense of Stadler 
[3T| . That is, it cannot be represented by non- 




AUACGACAUCGUAACUUCCUACUCGUUGUGGAACUGGCCGGGAGC 
CGGUCUCAGGAGCGAAUGGGUUAGGGGGCUCACGCGCUGUCAUUG 
GUUGGUCCUAUCGACAGCCUGAGAGGUCAGAAAGAGAGCGGUUGC 

Figure 24: The Pseudoknot PKI of the internal ribo- 
somal entry site (IRES) region [?T]: its diagram repre- 
sentation and three sequences of its neutral network as 
constructed by Inv. 

crossing arcs using the upper and lower half planes. 
Since DP-folding paradigms of pseudoknots folding 
are based on gap-matrices |16j . the minimal class of 
"missed" structure j^l are exactly these, nonplanar, 
3-noncrossing structures. In Figure [25] we showcase 
a nonplanar RNA pseudoknot structure and 3 se- 
quences of its neutral network, generated by Inv. 

As for the complexity of Inv, the determining 
factor is the subroutine Local-Search. Suppose 
that the target is decomposed into m intervals with 
the length £i, For each interval, we may 

assume that line 2 of Local-Search runs for fh 
times, and that line 14 is executed for gh times. Since 
Local-Search will stop (line 4) if Tstart = T ( 
line 3), the remainder of Local-Search, i.e. lines 
7 to 41 run for {fh — 1) times, each such execution 
having complexity 0(^/i). Therefore we arrive at the 
complexity 

m 

where c{t) denotes the complexity of the Cross. The 
multiplicities fh and gh depend on various factors, 
such as starts the random order of the elements of 

^ given the implemented truncations 
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UCCGCAUCGUCAAUCCCCUACUUAUAGUAUUGAUGGCGGCACAUUAUAAAUGUGGGGUGCUGCAAUCUCGCUGGGAUCUCAGGGG 
GCCUGAGGGCUUAUGUUCCCUAAUCCUAAUGAGCCAGUGAUGUAGGAUUUUUAGGCUGUCACUACCAGCGUUGCUGGCUAGGAAU 
UACCUAGGACCUGUUGGCGAUCCUGGACACAGGUCAGUGGGCGUCCAGGCUAGGUAGCCUGCUGUCCGAACUUUGGAAGACGUCA 



Figure 25: A nonplanar 3-noncrossing RNA structure 
together with three sequences realizing them as mfe- 
structures. 



Ui,U2 (see Algorithm [3|) and the probability p. Ac- 
eording to [33] the complexity of c{£h) is 0{e^'^^^^'^) 
and aceordingly the complexity of Inv is given by 



h=l 

In Figure [26] we present the average inverse folding 
time of several natural RNA structures taken from 
the PKdatabase [42]. These averages are computed 
via generating 200 sequences of the target's neutral 
networks. In addition we present in Table[l]the total 
time for 100 executions of Inv for an additional set 
of RNA pseudoknot structures. 
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total time 


success rate 
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40 


100 


4m 57.81s 


100% 


EC.PK2 m] 
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100 


5m 33.28s 


100% 


PMWaV-2 gS] 


62 


100 


Im 7.12s 


100% 


tRNA 


76 


100 


5m 2.49s 


100% 



Table 1: Inverse folding times for 100 executions of Inv 
for various RNA pseudoknot structures. In all cases all 
trials generated successfully sequences of the respective 
neutral networks. 
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